SGD is the widely adopted method to train CNN. Conceptually it approximatesthe population with a randomly sampled batch; then it evenly trains batches byconducting a gradient update on every batch in an epoch. In this paper, wedemonstrate Sampling Bias, Intrinsic Image Difference and Fixed Cycle PseudoRandom Sampling differentiate batches in training, which then affect learningspeeds on them. Because of this, the unbiased treatment of batches involved inSGD creates improper load balancing. To address this issue, we presentInconsistent Stochastic Gradient Descent (ISGD) to dynamically vary trainingeffort according to learning statuses on batches. Specifically ISGD leveragestechniques in Statistical Process Control to identify a undertrained batch.Once a batch is undertrained, ISGD solves a new subproblem, a chasing logicplus a conservative constraint, to accelerate the training on the batch whileavoid drastic parameter changes. Extensive experiments on a variety of datasetsdemonstrate ISGD converges faster than SGD. In training AlexNet, ISGD is21.05\% faster than SGD to reach 56\% top1 accuracy under the exactly sameexperiment setup. We also extend ISGD to work on multiGPU or heterogeneousdistributed system based on data parallelism, enabling the batch size to be thekey to scalability. Then we present the study of ISGD batch size to thelearning rate, parallelism, synchronization cost, system saturation andscalability. We conclude the optimal ISGD batch size is machine dependent.Various experiments on a multiGPU system validate our claim. In particular,ISGD trains AlexNet to 56.3% top1 and 80.1% top5 accuracy in 11.5 hours with 4NVIDIA TITAN X at the batch size of 1536.
展开▼